State of the art non-linear FETCH

MitchAlsup

unread,

5 Oct 2023, 18:35:2005/10/2023

to

I am looking for modern papers on the process of fetching
non-linear sets of instructions where the first instructions
in an ISSUE group come from a sequential long fetch and
other instructions arrive from some other mechanism {such
as a branch target cache}.

In 1991 I did what we called a packet cache FETCH mechanism.
Instructions which were not in the packet cache, were fetched
linearly from the unified cache, and when a branch was encount-
ered, the branch target was also then fetched and the instructions
were pasted together and then placed en-massé into the packet
cache. The instructions were pre-routed to their function unit and
register fields were expanded by 1-bit so we could directly annotate
intro-packet forwarding, use once results, and a few other things.

In 2000 I did what we called a Trace Cache FETCH mechanism.
Unhappy with the instruction packing density of the packet cache
above (3.6 instructions per packet which holds 6) and here faced
with the myriad of x86-64-isms and the extreme speed (5GHz
in 65nm) we chose to fetch 8 instructions per cycle and issue
4 on the first clock and 4 on the second clock. We played a bunch
of routing games (sacrificing the vertical neighbor instruction to
be an associated long constant, and other packing games...).
This system worked rather well but still sucked for instruction
packing density. However, comparing the packet cache with
6-instructions and the trace cache with 8:: the trace cache needs
2 branch predictions per cycle while the packet cache only needs
1 branch prediction per cycle.

So, faced with a 6-8-wide issue machine, and wanting to avoid
having to pre-route instructions to FU and further wanting to
accept having several instructions target a single FU and still
be issued together in a single cycle, neither of the above means
look like they would have traction.

Now, remember this is for a RISC machine with variable length
instructions (but only constants extend the length of instructions)
slightly different variables are in play.

a) So to continuously issue 6-I/C I probably have to FETCH 8
on a linear path
b) fetch from a branch target buffer for a first alternate path
set of instructions
c) fetch from a branch target target buffer for a second
alternate path set of instructions
d) there is going to be some kind of buffer between issue and
the reservation stations -- so, when a stream of instructions
all target the FMAC unit, they get issued 6-at a time and we
"go find" subsequent instructions not targeting the FMAC unit.
e) ½ of issue width will have a memory unit (3 on the 6-wide;
4 on the 8-wide) so there will be sufficient memory bandwidth
for the 30% memory instruction density of the instruction stream
and for those cases where it is higher.

f) instruction scheduling will be more like that of a Scoreboard
than of a reservation station so I can control the time of start
and the time of completion independently; but registers will
be completely renamed making this choice a sequencing issue
rather than a correctness issue.

g) there will be a set of Memory Buffers that are used to solve
memory ordering problems obeying the nuanced rules of memory
order where one end has causal order and the other end has strong
ordering, and this buffer is designed to perform memory references
based on the order specified in the TLB and whether ATOMIC stuff
is happening.

h) the overall datapath is designed to perform MAXTRIX300 at
1M×1M as fast as the memory system can deliver data, and
in particular be stall free for matrix's which fit in the L2 cache.
When we did this in 1991 we were getting 5.9 I/C for MATRIX300
and the packet cache was getting 2.2 for XLISP,.....without any
SIMD stuff.

It is easy to fetch enough linear instructions so that when there
are no branches one can issue 6-8-wide per cycle continuously.
On the other hand, when branches ARE present the game changes
completely. Where do you fetch the instructions at the target ?
How did you get an address of the branch target so you could
fetch there ?? How do you know how many instructions from the
sequential path get issued before instructions on the non-sequential
path(s) get issued ???

Anyone have pointers to current literature ?

Mitch

Quadibloc

unread,

5 Oct 2023, 22:01:2205/10/2023

to

On Thursday, October 5, 2023 at 4:35:20 PM UTC-6, MitchAlsup wrote:

> On the other hand, when branches ARE present the game changes
> completely. Where do you fetch the instructions at the target ?
> How did you get an address of the branch target so you could
> fetch there ?? How do you know how many instructions from the
> sequential path get issued before instructions on the non-sequential
> path(s) get issued ???

For starters, I doubt very much any of the literature would use a term
like "non-linear fetch" to refer to this. Because "non-linear" is usually
used when the basic problem is that one is dealing with x^2 or worse
instead of ax+b, and this is a completely different problem domain.

What you seem to be asking about is current literature on minimizing the
costs associated with branching.

What I've understood is that the current state of the art is:

- if branches are unconditional, then just calculate the branch target
as far ahead of time as possible in the pipeline (i.e. using interlocks
to be sure the base register doesn't change out from under you)...

- in the case of conditional branches, speculatively doing the branch
the "most likely" way (branch prediction) or allocating extra registers
to use spare superscalar and SMT capacity to do both paths at once
has been the technique... with worrying about Meltdown and Spectre
and company the latest wrinkle.

And then there's predication as another useful technique.

Has anything _new_ been discovered to help with code with branches?
That's been described in the open literature?

That's a good question, and I don't know the answer.

John Savard

MitchAlsup

unread,

5 Oct 2023, 22:32:4105/10/2023

to

On Thursday, October 5, 2023 at 9:01:22 PM UTC-5, Quadibloc wrote:
> On Thursday, October 5, 2023 at 4:35:20 PM UTC-6, MitchAlsup wrote:
>
> > On the other hand, when branches ARE present the game changes
> > completely. Where do you fetch the instructions at the target ?
> > How did you get an address of the branch target so you could
> > fetch there ?? How do you know how many instructions from the
> > sequential path get issued before instructions on the non-sequential
> > path(s) get issued ???
> For starters, I doubt very much any of the literature would use a term
> like "non-linear fetch" to refer to this. Because "non-linear" is usually
> used when the basic problem is that one is dealing with x^2 or worse
> instead of ax+b, and this is a completely different problem domain.
>
> What you seem to be asking about is current literature on minimizing the
> costs associated with branching.
>
> What I've understood is that the current state of the art is:
>
> - if branches are unconditional, then just calculate the branch target
> as far ahead of time as possible in the pipeline (i.e. using interlocks
> to be sure the base register doesn't change out from under you)...
<

Ok, I know this, but how do you calculate the address of the branch target
in the same cycle you fetch the group of instructions containing that branch ??
That is, you are 2-4 cycles in front of where you see the branch as an instruction
and its displacement. So, you have to somehow predict the branch target addr
based on nothing but the address you are currently fetching.
<
And the conditionality does not mater as much as you can bet-both-ways
in the case of conditional--set yourself up for when the conditional is not-
taken and set yourself up for when the conditional is taken and then
predict one course of events or the other.

>
> - in the case of conditional branches, speculatively doing the branch
> the "most likely" way (branch prediction) or allocating extra registers
> to use spare superscalar and SMT capacity to do both paths at once
> has been the technique... with worrying about Meltdown and Spectre
> and company the latest wrinkle.
<

I have already figured out how to do the register stuff that deals with
1-way, that-way, and the other way.

robf...@gmail.com

unread,

5 Oct 2023, 22:53:5705/10/2023

to

How about using absolute addresses for branches? No need to calculate the
target then; just load the next PCs. The target addresses could be placed in a
address specifying instruction before the branch. If fetching is going to be
done along multiple streams of instructions at the same time it is not
needed to know the branch outcome until a final select is done.

The reference to non-linear instruction fetches made me think of several
weird forms of instruction fetch such as instructions placed on the surface
of a sphere, or in some space manifold, organized like a spreadsheet
perhaps. I guess fetching along two different paths would be a step function.

Anyways, I am reminded of GPU branch-and-merge. I am also reminded of an
approach to superscalar operation that worked on basic-blocks in parallel.

Thomas Koenig

unread,

6 Oct 2023, 11:59:3206/10/2023

to

MitchAlsup <Mitch...@aol.com> schrieb:

> I am looking for modern papers on the process of fetching
> non-linear sets of instructions where the first instructions
> in an ISSUE group come from a sequential long fetch and
> other instructions arrive from some other mechanism {such
> as a branch target cache}.

Here's one which pops up on Googling:

https://webs.um.es/jlaragon/papers/aragon_ACSAC05.pdf , which is
not particularly new. US9367471B2 cites it. Then there is
http://aperais.fr/papers/elastic_instruction_fetching_hpca19.pdf or
https://arxiv.org/abs/2102.01764 , but I do not really know enough
about the subject to judge if they are any good or not.

MitchAlsup

unread,

6 Oct 2023, 13:22:1206/10/2023

to

Thank you for these references.

Message has been deleted

MitchAlsup

unread,

6 Oct 2023, 20:32:2806/10/2023

to

On Friday, October 6, 2023 at 4:41:24 PM UTC-5, EricP wrote:

> MitchAlsup wrote:
> > I am looking for modern papers on the process of fetching
> > non-linear sets of instructions where the first instructions
> > in an ISSUE group come from a sequential long fetch and
> > other instructions arrive from some other mechanism {such
> > as a branch target cache}.
<

> It sounds like you are thinking about prefetching a cluster
> (I'll call it that) of temporally and probabilistically correlated
> instruction cache lines at a set of virtual addresses,
> either prescribed in advance (a prefetch set) or deduced from
> recent execution history.
>
> Correct?
<
You may know more about it than I right now !!!
<
I see code this way:: You arrive at a known point (say subroutine entry)
and then you know that there are sequences of instructions (IP++),
connected by a web of branches. So, it seems to me that one has to
be able to fetch the sequential part through the most efficient instruction
deliver means (the I cache) and then be in a position to profit from providing
more instructions from a predictable 'alternate' path.
<
{{So, I am seeing what you call a cluster as a linear sequence of instructions.}}
<
These can be fetched with wide width (8-words/cycle), and are probabilistically
prone to branches (every 10.86 words one "takes" a branch of some form.) An
I cache and some buffering takes care of this part of Instruction fetching.
<
Remembering that we want to keep the fetch-insert pipeline short::
<
We have 2 strategies:: use the current fetch IP twice 1) for I cache 2) for ALT
cache. Here you have instructions associated distinct from their own address
so coherence is a bit tricky. The other strategy takes IP for accessing I cache
and takes something else for accessing ALT cache.
<
We are going to want the degenerate case of infinite sequence of inst to be
completely supplied by I cache; so ALT cache provides some words (4 ?!?)
at the target and adds some way to perform an ALT fetch next time we get here.
Thus, the I cache should be on the order of 2×-4× the size of ALT cache--and for
all intents and purposes can use the same TLB entry as I cache uses.
<
So between FETCH and DECODE is a layer that applies branch prediction
to the stream from I buffer and ALT cache providing 6-instructions to
decode. We know how the I cache + I buffer sequencer works, so we have
to find a way of getting the instructions to the point of choosing efficiently,
and that works across cycles continuously efficiently, too.

EricP

unread,

7 Oct 2023, 11:50:1307/10/2023

to

MitchAlsup wrote:
> On Friday, October 6, 2023 at 4:41:24 PM UTC-5, EricP wrote:
>> MitchAlsup wrote:
>>> I am looking for modern papers on the process of fetching
>>> non-linear sets of instructions where the first instructions
>>> in an ISSUE group come from a sequential long fetch and
>>> other instructions arrive from some other mechanism {such
>>> as a branch target cache}.
> <
>> It sounds like you are thinking about prefetching a cluster
>> (I'll call it that) of temporally and probabilistically correlated
>> instruction cache lines at a set of virtual addresses,
>> either prescribed in advance (a prefetch set) or deduced from
>> recent execution history.
>>
>> Correct?
> <
> You may know more about it than I right now !!!
> <
> I see code this way:: You arrive at a known point (say subroutine entry)
> and then you know that there are sequences of instructions (IP++),
> connected by a web of branches. So, it seems to me that one has to
> be able to fetch the sequential part through the most efficient instruction
> deliver means (the I cache) and then be in a position to profit from providing
> more instructions from a predictable 'alternate' path.
> <
> {{So, I am seeing what you call a cluster as a linear sequence of instructions.}}

A linear sequence is a subset of it. I saw a cluster as potentially being
spatially distributed set of cache lines that are likely to be touched
within a short temporal window.

Like a simd gather operation but for instructions, such that no matter
what pathways the program follows, when Fetch comes to build your uOp
packet it does not stall and can likely fill all 8 uOp lane slots.
For example

if (cond)
Foo (a,b,c);
... more code
else
Bar (d,e,f);

A prefetch cluster could consist the Virtual Address Ranges (VAR's)
covering the IF and call to Foo (VAR1), the alternate call to Bar (VAR2),
plus some amount of code at the entry to each of Foo (VAR3) and Bar (VAR4).

In order to fetch this code without stalling, it would have to prefetch
the virtual address translations for each possible address path VAR1..VAR4
into D$L1, possibly also setting the Accessed bit on PTE's,
load I-TLB entries, and prefetch cache lines of each range into I$L1.

On entry to either Foo or Bar a new prefetch cluster starts for that path.

> <
> These can be fetched with wide width (8-words/cycle), and are probabilistically
> prone to branches (every 10.86 words one "takes" a branch of some form.) An
> I cache and some buffering takes care of this part of Instruction fetching.
> <
> Remembering that we want to keep the fetch-insert pipeline short::
> <
> We have 2 strategies:: use the current fetch IP twice 1) for I cache 2) for ALT
> cache. Here you have instructions associated distinct from their own address
> so coherence is a bit tricky. The other strategy takes IP for accessing I cache
> and takes something else for accessing ALT cache.

I don't know what this ALT cache is. I was just thinking of how to
prefetch the normal I$L1 with primary and alternate pathways so that
Fetch doesn't stall on alternate branches and calls.

robf...@gmail.com

unread,

7 Oct 2023, 14:46:2507/10/2023

to

I think the idea is that the I$ and ALT-I$ are accessed at the same time which requires
two caches. Or one cache with two ports. I$ being the primary path and ALT-I$ being
for the alternate path. This is like fine-grained SMT except only a single thread is
present.

I think it would be very difficult to get multiple streams of execution going at the same
time. The primary stream I$ is likely to use up most of the available fetch spots. What
if there is another branch in ALT-I$? Is there going to be yet another ALT2-I$ cache?
How many alternate branch paths are going to be supported?

If the core supports SMT too then there may be more I$ cache ports available to
support multiple branch paths.

Message has been deleted

robf...@gmail.com

unread,

7 Oct 2023, 18:18:5307/10/2023

to

On Saturday, October 7, 2023 at 4:31:36 PM UTC-4, MitchAlsup wrote:

> ALT-cache is not necessarily indexed/addressed like the I cache. This, in deed, does
> make it necessarily another cache.

> >
> > I think it would be very difficult to get multiple streams of execution going at the same
> > time.
> <

> One of the things I am explicitly NOT trying to do is SMT. Solving Spectré and Meltdown
> pretty much make SMT µArchitectures untenable (for safety) anyway. If code has access
> to a clock able to count instructions, you can't share caches, MMU, TLB, I/O Buffers,
> predictors, prefetchers,...

> <
> > The primary stream I$ is likely to use up most of the available fetch spots. What
> > if there is another branch in ALT-I$? Is there going to be yet another ALT2-I$ cache?
> > How many alternate branch paths are going to be supported?
> <

> I expect the I cache to feed the sequential nature of instruction flow and the ALT cache
> to service the excursions from the sequential flow. What I don't want is to add cycles
> between the fetches and the delivery to instruction queues. My 1-wide machines has
> 2 cycles between instruction being flopped after SRAM sense amplifiers and being
> issued into execution. I want this 6-wide machine to have only 1 more::
> <
> FETCH--BUFFER--CHOOSE--DECODE--ISSUE--Execute--CACHE --ALIGN--WAIT --WRITE
> >
> The output of Choose is 6 instructions,
> The output of Issue is 6 6instructions with register and constant operands at their
> .....respective instruction queue.

At some point a selection between I$ and ALT-I$ clusters must be made. Is this the purpose
of choose? I would think the choice would be made after EXECUTE once the branch
outcome is determined.

> <
> > If the core supports SMT too then there may be more I$ cache ports available to
> > support multiple branch paths.
> <

> SMT is so 2001.

I take that as encouragement; I am less than 20 years behind the times.

robf...@gmail.com

unread,

7 Oct 2023, 18:21:4507/10/2023

to

On Saturday, October 7, 2023 at 5:20:33 PM UTC-4, EricP wrote:

> >> On entry to either Foo or Bar a new prefetch cluster starts for that path..

> >>> <
> >>> These can be fetched with wide width (8-words/cycle), and are probabilistically
> >>> prone to branches (every 10.86 words one "takes" a branch of some form.) An
> >>> I cache and some buffering takes care of this part of Instruction fetching.
> >>> <
> >>> Remembering that we want to keep the fetch-insert pipeline short::
> >>> <
> >>> We have 2 strategies:: use the current fetch IP twice 1) for I cache 2) for ALT
> >>> cache. Here you have instructions associated distinct from their own address
> >>> so coherence is a bit tricky. The other strategy takes IP for accessing I cache
> >>> and takes something else for accessing ALT cache.
> >> I don't know what this ALT cache is. I was just thinking of how to
> >> prefetch the normal I$L1 with primary and alternate pathways so that
> >> Fetch doesn't stall on alternate branches and calls.
> >
> > I think the idea is that the I$ and ALT-I$ are accessed at the same time which requires
> > two caches. Or one cache with two ports. I$ being the primary path and ALT-I$ being
> > for the alternate path. This is like fine-grained SMT except only a single thread is
> > present.
> >
> > I think it would be very difficult to get multiple streams of execution going at the same
> > time. The primary stream I$ is likely to use up most of the available fetch spots. What
> > if there is another branch in ALT-I$? Is there going to be yet another ALT2-I$ cache?
> > How many alternate branch paths are going to be supported?
> >
> > If the core supports SMT too then there may be more I$ cache ports available to
> > support multiple branch paths.

> Maybe its an exercise in keeping cache miss buffers busy.
> Say I$ and D$ L1 each have 8 miss buffers. 8 I-TLB page table walkers could
> keep D$ busy so it could be the equivalent to having 16 HW threads running.

Will not the instructions already be in the I$ most of the time? It is just that it is not
referenced by ALT-PC.

>
> So a question is how to generate useful work for those HW threads.
> Prefetching clusters might be an answer.

MitchAlsup

unread,

7 Oct 2023, 18:43:3207/10/2023

to

In this thread we are anticipating 6-8-wide instructions per cycle. Under these circumstances
prediction prior to Issue, branch resolution cleans up the data-flow later. {{The 1-wide in
order µArchitecture would do as you suggest, but you see after it fetches way far ahead,
it can use that I $ BW to prefetch branch targets it has not yet decoded (scan ahead)}}.
<
Once you get to 6-wide you have to be fetching and alternate fetching simultaneously.
Once you get to 8-wide you might have to start doing 2 predictions per cycle....

EricP

unread,

10 Oct 2023, 13:23:4910/10/2023

to

MitchAlsup wrote:
> On Saturday, October 7, 2023 at 5:18:53 PM UTC-5, robf...@gmail.com wrote:
>> On Saturday, October 7, 2023 at 4:31:36 PM UTC-4, MitchAlsup wrote:

>>> On Saturday, October 7, 2023 at 1:46:25 PM UTC-5, robf...@gmail..com wrote:
>>> <
>>>> The primary stream I$ is likely to use up most of the available fetch spots. What
>>>> if there is another branch in ALT-I$? Is there going to be yet another ALT2-I$ cache?
>>>> How many alternate branch paths are going to be supported?
>>> <
>>> I expect the I cache to feed the sequential nature of instruction flow and the ALT cache
>>> to service the excursions from the sequential flow. What I don't want is to add cycles
>>> between the fetches and the delivery to instruction queues. My 1-wide machines has
>>> 2 cycles between instruction being flopped after SRAM sense amplifiers and being
>>> issued into execution. I want this 6-wide machine to have only 1 more::
>>> <
>>> FETCH--BUFFER--CHOOSE--DECODE--ISSUE--Execute--CACHE --ALIGN--WAIT --WRITE
>>> The output of Choose is 6 instructions,
>>> The output of Issue is 6 6instructions with register and constant operands at their
>>> .....respective instruction queue.
>> At some point a selection between I$ and ALT-I$ clusters must be made. Is this the purpose
>> of choose? I would think the choice would be made after EXECUTE once the branch
>> outcome is determined.
> <
> In this thread we are anticipating 6-8-wide instructions per cycle. Under these circumstances
> prediction prior to Issue, branch resolution cleans up the data-flow later. {{The 1-wide in
> order µArchitecture would do as you suggest, but you see after it fetches way far ahead,
> it can use that I $ BW to prefetch branch targets it has not yet decoded (scan ahead)}}.
> <
> Once you get to 6-wide you have to be fetching and alternate fetching simultaneously.
> Once you get to 8-wide you might have to start doing 2 predictions per cycle....

A 6 instruction fetch, say average 5 bytes each = 30 bytes per fetch.
Lets call it 32 bytes. So I$L1 needs to feed fetch an aligned 32 byte
fetch block per clock to sustain fetching with no bubbles.

We also have to deal with the I$L1 read access latency.
Lets say I$L1 has a pipelined latency of 3 and throughput of 1.
To not stall fetch must request a I$L1 block 3 cycles before it needs it.
If a branch wants to change from sequential to alternate path and not stall
we need both blocks in the fetch buffers at the same instant.
If we wait until we parse the branch instruction to begin the I$L1 read
then it is too late.

We need a _Fetch Block Prefetch Predictor_ as the start of alternate path
fetching has to propagate backwards 4 clocks such that we can switch from
sequential to alternate if we want to without missing a clock or even a lane.

If the I$L1 has dual even-odd banks then on average it can supply
two 64-byte block pairs per clock. That should be more than enough to
continuously feed the fetch unit.

A Fetch Block Prefetch Predictor (FBPP) could work by having fetch attach an
alternate fetch block physical address to the block it fetched 4 clocks ago,
and store this in the I$L1 cache line, for each of the low and high blocks.
Note this is a prefetch hint to fetch that in 3 clocks the fetch direction
might change. The prefetched buffer address has to be verified as correct
when the branch instruction is seen and its target virtual address is
calculated and translated.

MitchAlsup

unread,

10 Oct 2023, 14:46:3610/10/2023

to

Consider that the IP Adder creates IP+some-power-of-2 and IP+16+some-
power-of-2. Now we are in a position to fetch 8 words as::
a) 0-1
b) 1-2
c) 2-3
d) 3-0 of the next IP
So, for an additional carry chain, you suffer no gross alignedness problems.

>
> We also have to deal with the I$L1 read access latency.
> Lets say I$L1 has a pipelined latency of 3 and throughput of 1.
<

Then you made it too big !! And why should not each SRAM in a cache line
not be accessible independently (that is throughput = 4 or some higher power
of 2).

<
> To not stall fetch must request a I$L1 block 3 cycles before it needs it.
> If a branch wants to change from sequential to alternate path and not stall
> we need both blocks in the fetch buffers at the same instant.
<

Obviously

<
> If we wait until we parse the branch instruction to begin the I$L1 read
> then it is too late.
<

Obviously--which is why we scan ahead or use some predictor to identify
the branches before we get to them.

>
> We need a _Fetch Block Prefetch Predictor_ as the start of alternate path
> fetching has to propagate backwards 4 clocks such that we can switch from
> sequential to alternate if we want to without missing a clock or even a lane.
<

The other through train I have been working on is that we have a multiplicity
of successor addresses that are present when the sequential and Alternate
instructions are available; and a predictor predicts which one to use as
sequential and which one to use as ALT.

>
> If the I$L1 has dual even-odd banks then on average it can supply
> two 64-byte block pairs per clock. That should be more than enough to
> continuously feed the fetch unit.
<

L1 will have 4 blocks per tag each block contains 16-bytes.

>
> A Fetch Block Prefetch Predictor (FBPP) could work by having fetch attach an
> alternate fetch block physical address to the block it fetched 4 clocks ago,
> and store this in the I$L1 cache line, for each of the low and high blocks.
<

The trick is getting these things started. after started, they tend to work rather well.

MitchAlsup

unread,

10 Oct 2023, 15:07:0710/10/2023

to

I need to elaborate here::
<
A D$ may have a 3-cycle latency and 1 throughput per bank but this has
logic in it that the I$ does not need--that is the byte alignment logic
that takes the 3rd cycle. I$ can deliver aligned chunks in 2 cycles of
the same metrics.
<
The only thing getting in the way, here, of $ size is wire delay. Data/Inst/Tag
SRAMs only need the lower order ~16(-3) bits of address which comes out
of the adder at least 4 gate delays before the final HOB resolves. Thus, the
addresses reach SRAM flip-flops by clock edge. SRAM access takes 1 full
clock. Tag comparison (hit) output multiplexing, byte alignment, and result
bus drive take the 3rd cycle. Over in I$ land, output multiplexing chooses
which data goes into the Inst Buffer, while the IB is supplying instructions
into PARSE/DECODE. While in IB, instructions are scanned for branches and
their targets fetched (this is the 1-wide machine)--in the 6-wide machine we
are getting close to the point where in spaghetti codes all instructions might
come from the ALT cache. Indeed, the Mc 88120 had it packet cache arranged
to supply all instructions and missed instructions flowed through the C$. Here
we used a 4-banked D$ and a 6-ported C$ to deal with all memory ordering
problems and would only deliver LDs that had resolved all dependencies.
<
BUT DRAM was only 5 cycles away (50ns) so the smallish caches and lack
of associativity did not hurt all that much. This cannot be replicated now
mainly due to wire delay and balancing the several hundred cycles to DRAM
with L2 and L3 caching.

robf...@gmail.com

unread,

11 Oct 2023, 00:46:0011/10/2023

to

Can some parsing be done at the fetch stage, if the encoding is really simple? I have
tried this before in a simple pipelined CPU and it seemed to work, but it was with
single cycle cache latency. I am not sure what impact would be on timing. Maybe
too many gates? It could be used to chop one cycle off the latency.

Rather than use a hardware fetch predictor would it be possible to use a static
compiler generated prediction? Seems to me that a “JMP Alt-target(s)” instruction
could be in the instruction stream sometime before the branch instructions it is
associated with. If it could be placed enough instructions before the branch it may
be possible to hide the latency. JMP-Alt could begin fetch from multiple targets,
then have a following branch select which Alt path is the correct one.

MitchAlsup

unread,

11 Oct 2023, 16:46:0911/10/2023

to

The encoding means 40-gates and 4-gates of delay gives you a pointer to the next
instruction, a pointer to any displacement, and a pointer to any immediate. It is
cheap enough that you can afford to perform it prior to set-selection out of the
I$.

>
> Rather than use a hardware fetch predictor would it be possible to use a static
> compiler generated prediction? Seems to me that a “JMP Alt-target(s)” instruction
> could be in the instruction stream sometime before the branch instructions it is
> associated with. If it could be placed enough instructions before the branch it may
> be possible to hide the latency. JMP-Alt could begin fetch from multiple targets,
> then have a following branch select which Alt path is the correct one.
<

The entire raison détre of the ISA is that is requires fundamentally fewer instructions
than other RISC ISAs. My 66000 only requires 70% the number of instructions that
RISC-V requires. Adding a prepare-to-branch instruction would ruin the density.

MitchAlsup

unread,

12 Oct 2023, 14:09:2312/10/2023

to

It occurs to me that I could invent a "flow" Cache--such a cache would supply
2 (or 3) indexes to the I$ and some "count" information--perhaps the number
of words consumed at each index and the kind of flow between the I$ fetch
and the ALT$ fetch (or perhaps 2 fetches to the I$).
<
The flow cache would have the property that it is small enough and fast enough
so that it could index itself on the successive cycle, while spewing off indexes
to the I$. Each I$ index is associated with the number of words to fetch on that
path, then the instruction buffer collects the 6-inst to enter decode and issue.
<
Since the fetches are in words and instructions are in words, there is no loss in
storage density (except for the flow cache itself) and when it is hitting we have
more than enough instructions each cycle to fill up the execution window easily.
<
Now, off to figure out how to backup from a misprediction in zero cycles.....

EricP

unread,

17 Oct 2023, 14:42:5417/10/2023

to

MitchAlsup wrote:
> I am looking for modern papers on the process of fetching
> non-linear sets of instructions where the first instructions
> in an ISSUE group come from a sequential long fetch and
> other instructions arrive from some other mechanism {such
> as a branch target cache}.

This gets difficult to follow because the word "prefetch" has been
overloaded so many times. My focus is on the instruction queue that
directly feeds the fetch-parse unit, which used to be called the
instruction prefetch queue but lately seems to be called by some
the Fetch Target Queue (FTQ).
Most uses for the term prefetch are D$ and I$ cache prefetching -
the monitoring, predicting and generating addresses to the cache
to trigger L1/L2 prefetch (also called "address generation" or AGEN).

A decoupled front end uses a Fetch Target Queue:

Fetch Directed Instruction Prefetching, 1999
https://web.eecs.umich.edu/~taustin/papers/MICRO32-fdp.pdf

then using the BTB to drive the FTQ:

A Scalable Front-End Architecture for Fast Instruction Delivery, 1999
http://www.cs.cmu.edu/afs/cs/academic/class/15740-f03/public/doc/discussions/uniprocessors/technology/scalable-front-end-isca99.pdf

Branch Target Buffer Organizations, 2023
https://hal.science/hal-04234792/

Shooting Down the Server Front-End Bottleneck, 2022
https://folk.ntnu.no/rakeshk/pubs/Shotgun_TOCS.pdf

EricP

unread,

17 Oct 2023, 15:18:2317/10/2023

to

Notice that these use a huge amount of resources to get the BTB
to generate the next fetch block address with sufficient lead time
for fetch to have zero bubbles.

I suggest this might be done easier by attaching it to the 32B block
in the I$L1 cache line fetched 4 clocks earlier. Just requires bits to
hold the Valid bit, an AltOnly bit, and alt physical block address.
If AltOnly=0 it would prefetch on both the sequential and alternate paths,
or if AltOnly=1 then just follow alt physical address and proceed sequential.
Cost is maybe 120 extra bits per cache line, one set for each 32B block.

EricP

unread,

17 Oct 2023, 17:05:4817/10/2023

to

Ok. I thought the I$L1 read pipeline would always be 3 stages
as it didn't look like there would be enough slack in stage 3 to do
much of anything else (that being instruction alignment and parse).
- cache address decode, latch
- word line drive, bit line drive, sense amp, latch
- tag match, way select mux, drive to prefetch buffer, latch

So just following my train of thought, it might as well read a whole
cache line into the prefetch buffers for each cache access. Out of the
prefetch buffer it reads one or two 32B blocks for consecutive
virtual addresses.

The 1 or 2 blocks go to the block aligner which uses the FetchRIP low
address bits to shift from 0 to 7 instruction words.

That result feeds to the Log2 Parser which selects up to 6 instructions
from those source bytes.

Fetch Line Buf 0 (fully assoc index)
Fetch Line Buf 1
Fetch Line Buf 2
Fetch Line Buf 3
v v
32B Blk1 32B Blk0
v v
Alignment Shifter 8:1 muxes
v
Log2 Parser and Branch Detect
v v v v v v
I5 I4 I3 I2 I1 I0

> <
> The only thing getting in the way, here, of $ size is wire delay. Data/Inst/Tag
> SRAMs only need the lower order ~16(-3) bits of address which comes out
> of the adder at least 4 gate delays before the final HOB resolves. Thus, the
> addresses reach SRAM flip-flops by clock edge. SRAM access takes 1 full
> clock. Tag comparison (hit) output multiplexing, byte alignment, and result
> bus drive take the 3rd cycle. Over in I$ land, output multiplexing chooses
> which data goes into the Inst Buffer, while the IB is supplying instructions
> into PARSE/DECODE. While in IB, instructions are scanned for branches and
> their targets fetched (this is the 1-wide machine)--in the 6-wide machine we
> are getting close to the point where in spaghetti codes all instructions might
> come from the ALT cache. Indeed, the Mc 88120 had it packet cache arranged
> to supply all instructions and missed instructions flowed through the C$. Here
> we used a 4-banked D$ and a 6-ported C$ to deal with all memory ordering
> problems and would only deliver LDs that had resolved all dependencies.

That was one of the reasons for suggesting a 32B block as the fetch unit.
Your My66k instructions words (IWd) are 4 bytes, aligned, and instructions
can be 1 to 3 IWd long. If the average is 5 bytes per instruction then a
32B block holds about 6 instructions, which is also about a basic block
size and on average would contain 1 branch.

So by attaching an optional next-alt-fetch physical address to each
32B block in a 64B cache line, then it can specify both the sequential
and alt fetch path start to follow.

That allows 1 or 2 Fetch Line Bufs to be loaded with the alt path
so when the Log2 Parser sees a branch it can immediately switch to
parsing instructions from the alt path on the next clock.
If two cache lines from the alt path are loaded, that's 4 prefetch blocks
which should be enough to cover the I$ read pipeline latency.

The BTB and Return Stack Predictor are also providing virtual address
prefetch predictions, so that has to be integrated in this too.

robf...@gmail.com

unread,

17 Oct 2023, 18:15:3217/10/2023

to

What if only some alt paths are supported? Could alt paths be available only for
forward branches? How does the number of alt paths supported impact
performance?

How does the alt-path get set in the fetch block? When things are cold the alt path
may not be set correctly. So, it would need to be updated in the I$; I$ line invalidate
required then? I have thought of placing the alt paths for the top four alternate
paths in a function header rather than for every block in the function. It would
reduce the I$ storage requirements. Then fetch from all four paths and use a
branch instruction to select the path. It would mean that only a limited number of
alt paths are supported but for many small functions a small number may be
adequate.

MitchAlsup

unread,

17 Oct 2023, 21:00:2417/10/2023

to

Essentially all loops will fit in the IBuffer of this GBOoO machine.
Any short forward branch can be treated as a non-branch, for fetch purposes.
CALL/RET targets may be supported by a different "predictor".
Jump indirect will have its own "predictor".

<
> Could alt paths be available only for forward branches?
<

Alt path is a port reduction scheme on the I$

<
> How does the number of alt paths supported impact performance?
<

During execution you are going to want ~8-16 (especially when switch()s are only
1 flow control instruction--rather than 2 or 3).
As an IPC support module, you probably want at least 256.
But you want them to be independent of the size and associativity of I$

<
>
> How does the alt-path get set in the fetch block? When things are cold the alt path
> may not be set correctly.
<

may not !?!?!?!?! I expect predicting compulsory misses is completely futile.
However, I expect that predicting capacity/collision misses might be useful.

<
> So, it would need to be updated in the I$;
<

Not necessarily--flow prediction will be a cache all by itself--and it predicts only
the fetch indexes--and "hits" are checked during decode.

<
> I$ line invalidate required then?
<

You want the actual executed instructions to come from cache(s) that get invalidated
when I$ no-longer holds that instruction--{so, remote invalidates work} {{Packet caches
and trace caches have these problems.}}

<
>I have thought of placing the alt paths for the top four alternate
> paths in a function header rather than for every block in the function.
<

Function header is not present on 1-wide machines and I want code compatibility.

<
> It would
> reduce the I$ storage requirements. Then fetch from all four paths and use a
> branch instruction to select the path. It would mean that only a limited number of
> alt paths are supported but for many small functions a small number may be
> adequate.
<

An interesting notion.

MitchAlsup

unread,

17 Oct 2023, 22:00:3817/10/2023

to

On Tuesday, October 17, 2023 at 4:05:48 PM UTC-5, EricP wrote:
> MitchAlsup wrote:
> > On Tuesday, October 10, 2023 at 1:46:36 PM UTC-5, MitchAlsup wrote:

> > I need to elaborate here::
> > <
> > A D$ may have a 3-cycle latency and 1 throughput per bank but this has
> > logic in it that the I$ does not need--that is the byte alignment logic
> > that takes the 3rd cycle. I$ can deliver aligned chunks in 2 cycles of
> > the same metrics.
<
> Ok. I thought the I$L1 read pipeline would always be 3 stages
> as it didn't look like there would be enough slack in stage 3 to do
> much of anything else (that being instruction alignment and parse).
<

This is where the 4-gates of delay to create unary next instruction pointers
comes in handy. It can be developed while set comparison is done and is
ready when set-multiplexer select line is ready. You don't have to store the
information in the I$ yet it is ready by the time you first see the bit-patterns.
You can redevelop that data from the instruction bit-pattern at lower cost
than you can store it !!
<
> - cache address decode, latch {flip-flop}
> - word line drive, bit line drive, sense amp, latch {flip-flop}
> - tag match, way select mux, drive to prefetch buffer, latch {flip-flop}

>
> So just following my train of thought, it might as well read a whole
> cache line into the prefetch buffers for each cache access. Out of the
> prefetch buffer it reads one or two 32B blocks for consecutive
> virtual addresses.
<

Cache line is 16-words, ... I can see where being able to do four × 4-word
fetches independently might me more sound, but this depends on the
instruction buffer µarchitecture.

>
> The 1 or 2 blocks go to the block aligner which uses the FetchRIP low
> address bits to shift from 0 to 7 instruction words.
<

Note: next-instruction unary pointers are available here.

>
> That result feeds to the Log2 Parser which selects up to 6 instructions
> from those source bytes.
<

Since next instructions pointers were already available, one can find
all candidate instruction-specifiers (6-wide) in 6-gates of delay! So, this
PARSE logic can be overlapped with I$ set selection, Instruction Buffer
read-out; leaving DECODE looking at individual instructions for each
DECODE slot.

>
> Fetch Line Buf 0 (fully assoc index)
> Fetch Line Buf 1
> Fetch Line Buf 2
> Fetch Line Buf 3
> v v
> 32B Blk1 32B Blk0
> v v
> Alignment Shifter 8:1 muxes
> v
> Log2 Parser and Branch Detect
> v v v v v v
> I5 I4 I3 I2 I1 I0
<

Yes, pretty much.

<
> > <
> > The only thing getting in the way, here, of $ size is wire delay. Data/Inst/Tag
> > SRAMs only need the lower order ~16(-3) bits of address which comes out
> > of the adder at least 4 gate delays before the final HOB resolves. Thus, the
> > addresses reach SRAM flip-flops by clock edge. SRAM access takes 1 full
> > clock. Tag comparison (hit) output multiplexing, byte alignment, and result
> > bus drive take the 3rd cycle. Over in I$ land, output multiplexing chooses
> > which data goes into the Inst Buffer, while the IB is supplying instructions
> > into PARSE/DECODE. While in IB, instructions are scanned for branches and
> > their targets fetched (this is the 1-wide machine)--in the 6-wide machine we
> > are getting close to the point where in spaghetti codes all instructions might
> > come from the ALT cache. Indeed, the Mc 88120 had it packet cache arranged
> > to supply all instructions and missed instructions flowed through the C$. Here
> > we used a 4-banked D$ and a 6-ported C$ to deal with all memory ordering
> > problems and would only deliver LDs that had resolved all dependencies.
<
> That was one of the reasons for suggesting a 32B block as the fetch unit.
<

It is certainly correct within a power of 2.

<
> Your My66k instructions words (IWd) are 4 bytes, aligned, and instructions
> can be 1 to 3 IWd long. If the average is 5 bytes per instruction then a
> 32B block holds about 6 instructions, which is also about a basic block
> size and on average would contain 1 branch.
<

Yes, plus if there is no branch, I can set up the ALT index to be the sequential
index and cover all sequential instruction sequences.
<
My statistics generator says I get 60.8% of branches are taken and branches
are 19.2% of instruction stream, one takes a branch every 9.17 instructions
{hand waving} So, between 8 and 10 words is fetch length of the sequential
fetch, and another 4 on the alternate fetch should cover everything except
branch density spikes.

>
> So by attaching an optional next-alt-fetch physical address to each
> 32B block in a 64B cache line, then it can specify both the sequential
> and alt fetch path start to follow.
<

Change/attaching/into/associating/ and we are on the same page.
Also change/address/into/index/ so we store fewer bits in the flow cache.

>
> That allows 1 or 2 Fetch Line Bufs to be loaded with the alt path
> so when the Log2 Parser sees a branch it can immediately switch to
> parsing instructions from the alt path on the next clock.
<

Must query the branch predictor to decide which path to follow.

<
> If two cache lines from the alt path are loaded, that's 4 prefetch blocks
> which should be enough to cover the I$ read pipeline latency.
<

Here is where 16×¼ cache lines will be superior.

>
> The BTB and Return Stack Predictor are also providing virtual address
> prefetch predictions, so that has to be integrated in this too.
<

So 1991 as a design point.

EricP

unread,

18 Oct 2023, 14:21:2018/10/2023

to

For alt paths to be prefetched requires some kind of trigger.
Current systems seem to use the BTB as a trigger and, from what I read,
this can make the BTB structures quite large.
For zero bubble fetching is that the trigger has to be
set to go off far enough ahead to hide the bubble.
More paths, father ahead, means larger structures, slower access.

> How does the alt-path get set in the fetch block? When things are cold the alt path
> may not be set correctly. So, it would need to be updated in the I$; I$ line invalidate
> required then?

Existing Intel and AMD x86 designs have fetch marking instructions
with parser start and stop bit, so there are pathways to back propagate
from fetch-parse to I$ in current designs.

This might be done by when fetch reads a I$ [row,set,block] entry it comes
with the coordinates with the instruction packet, which we save in fetch.
This saves going through the index and tag structures to update that entry.

Fetch could keep a 4-entry circular buffer with the [row,set,block] of
32B blocks it just fetched from. It would just write physical address
of the current block to the auxiliary fields of block 4 clocks ago.

This update would be stored in a pending update buffer in cache
and written back when there was an unused I$ read or write cycle.
(And if the cache data area is also write pipelined then multiple pending
update buffers and coordination logic would be required).

This might imply frequent write updates to I$ cache.
The problem is finding an unused I$ cycle. If we are fetching both
sequential and possibly 1 alternate path then that might fully saturate
the I$ access bandwidth. And there is also prefetch writes to load
cache lines into I$L1 from I$L2.

> I have thought of placing the alt paths for the top four alternate
> paths in a function header rather than for every block in the function. It would
> reduce the I$ storage requirements. Then fetch from all four paths and use a
> branch instruction to select the path. It would mean that only a limited number of
> alt paths are supported but for many small functions a small number may be
> adequate.

The way I'm thinking, fetch buffers would be an expensive resource,
fully assoc index, really a small I$L0 cache, so the buffers would
be recycled quite quickly and not be prefetched too far ahead.

MitchAlsup

unread,

18 Oct 2023, 19:11:0318/10/2023

to

When big enough to cover the latency from arrival to insertion into reservation
station, you have enough buffers.

EricP

unread,

19 Oct 2023, 09:24:4519/10/2023

to

MitchAlsup wrote:
> On Wednesday, October 18, 2023 at 1:21:20 PM UTC-5, EricP wrote:

By the way, this article talks a bit about AMD and zero bubble fetching.

https://chipsandcheese.com/2023/10/08/zen-5s-leaked-slides/

They don't define whether "zero bubble" means, zero clock bubble or
zero instruction bubble, the difference matters when one is decoding
4 instructions/clock.

MitchAlsup

unread,

19 Oct 2023, 14:08:1519/10/2023

to

Zero bubble means taken branches appear to take 1 cycle.
<
And AMD's front end seems to follow my general guestimate of how
I want to push forward.
<
Thanks EricP

MitchAlsup

unread,

5 Dec 2023, 16:46:2205/12/2023

to

EricP wrote:

> MitchAlsup wrote:
>>
> That result feeds to the Log2 Parser which selects up to 6 instructions
> from those source bytes.

> Fetch Line Buf 0 (fully assoc index)
> Fetch Line Buf 1
> Fetch Line Buf 2
> Fetch Line Buf 3
> v v
> 32B Blk1 32B Blk0
> v v
> Alignment Shifter 8:1 muxes
> v
> Log2 Parser and Branch Detect
> v v v v v v
> I5 I4 I3 I2 I1 I0

Been thinking of this in the background the last month. It seems to me that
a small fetch-predictor is in order.

This fetch-predictor makes use of the natural organization of the ICache as
a matrix of SRAM macros (of some given size:: say 2KB) each SRAM macro having
a ¼ Line access width. Let us call this the horizontal direction. In the
vertical direction we have sets (or ways if you prefer).

Each SRAM macro (2KB) has 128-bits and 128-words so we need a 7-bit index.
Each SRAM column has {2,3,4,...} SRAM macros {4=16KB ICache}; so we need
{2,3,..}-bits of set-index.

Putting 4 of these index sets together gives us a (7+3)×4 = 40-bit fetch-
predictor entry, add a few bits for state and control. {{We may need to
add a field used to access the fetch-predictor for the next cycle}}.

We are now in a position to access 4×¼ = 1 cache line (16 words) from the
matrix of SRAM macros.

Sequential access:
It is easy to see that one can access 16 words (16 potential instructions)
in a linear sequence even when the access crosses a cache line boundary.

Non-sequential access:
Given a 6-wide machine (and known instruction statistics wrt VLE utilization)
and the assumption of 1 taken branch per issue-width:: the fetch-predictor
accesses 4 SRAM macros indexing the macro with the 7-bit index, and choosing
the set from the 3-bit index. {We are accessing a set-associative cache as if
it were directly mapped.}

Doubly non-sequential access:
There are many occurrences where there are a number of instructions on the
sequential path, a conditional branch to a short number of instructions on
the alternate path ending with a direct branch to somewhere else. We use
the next fetch-predictor access field such that this direct branch does not
incur an additional cycle of fetch (or execute) latency. This direct branch
can be a {branch, call, or return}

Ramifications:
When instructions are written into the ICache, they are positioned in a set
which allows the fetch-predictor to access the sequential path of instructions
and the alternate path of instructions.

All instructions are always fetched from the ICache, which has been organized
for coherence by external SNOOP activities, so there is minimal excess state
and no surgery at context switching or the like.

ICache placement ends up dependent on the instructions being written in accord
with how control flow arrived at this point (satisfying the access method
above).

This organization satisfies several "hard" cases::

a) 3 ST instructions each 5 words in size: the ICache access supplies 16 words
all 4×¼ accesses are sequential but may span cache line boundaries and set
placements. These sequences are found in subroutine prologue setting up local
variables with static assignments on the stack. The proposed machine can only
perform 3 memory references per cycle, so this seems to be a reasonable balance.

b) One can process sequential instruction up to a call and several instructions
at the call-target in the same issue cycle. The same can transpire on return.

c) Should a return find a subsequent call (after a few instructions) the EXIT
instruction can be cut short and the ENTER instruction cut short because all
the preserved registers are already where they need to be on the call/return
stack; taking fewer cycles wandering around the call/return tree.

So:: the fetch-predictor contains 5 accesses, 4 to ICache of instructions and
1 to itself for the next fetch-prediction.

{ set[0] column[0] set[1] column[1] set[2] column[2] set[3] column[3] next}
| +-------+ | +-------+ | +-------+ | +-------+ | +-------+
| | | +-->| | | | | | | | +->| |
| +-------+ +-------+ | +-------+ | +-------+ +-------+
+--> | | | | | | | | | |
+-------+ +-------+ | +-------+ | +-------+
| | | | +--> | | +--> | |
+-------+ +-------+ +-------+ +-------+
| | | | | | | |
+-------+ +-------+ +-------+ +-------+
| | | |
V V V V
inst[0] inst[1] inst[2] inst[3]

The instruction groups still have to be "routed" into some semblance of order
but this can take place over the 2 or 3 decode cycles.

All of the ICache tag checking is performed "later" in the pipeline, taking
tag-check and selection multiplexing out of the instruction delivery path.